I. Clean and explore data
II. Descriptive statistics
III. Histogram
IV. Scatterplot
V. Simple linear regression VI. Advanced: interactive charts
# Load packages
# Load "ggplot2" package which is a powerful visualization package
library("ggplot2")
# Load "ggThemeAssist", a RStudio-Addin that delivers a graphical interface for editing ggplot2 theme elements
library("ggThemeAssist")
## Warning: package 'ggThemeAssist' was built under R version 3.6.3
# Load "plotly" which makes a static ggplot2 chart interactive with the ggplotly function
library("plotly")
## Warning: package 'plotly' was built under R version 3.6.3
# Load "dplyr" package which is a popular package for working with data frames
library("dplyr")
# Load "stargazer" that creates well-formatted tables
library("stargazer")
library("shiny")
## Warning: package 'shiny' was built under R version 3.6.3
# Specify the directory.
setwd("C:\\Users\\cupid\\Documents\\R\\AGEC317_2020Fall")
# Load a csv file
DF_PS2 <- read.csv("Instacart_demo.csv")
Note that “()” in R is used to call a function. “[]” is used for subsetting vectors, arrays, matrices, and data frame (and other such objects).
The “nrow()” command with a data frame object allows us to know the number of rows.
nrow(DF_PS2)
## [1] 5000
We can use the “ncol()” command to obtain the number of columns.
ncol(DF_PS2)
## [1] 4
The “dim()” command tells us the dimension of the given data frame. The command outputs two numbers: the first one indicates the number of rows, and the second on indicate the number of columns.
dim(DF_PS2)
## [1] 5000 4
“names()” and “colnames()” both can retrieve names of columns, i.e., names of variables.
names(DF_PS2)
## [1] "X" "order_id" "count_reorders" "count_products"
“str()” is a handy command to overview the structure of a data frame object. It summarizes data frame information, such as dimension, variable (column) names, types of an object for each variable, overviews of the first few observations.
str(DF_PS2)
## 'data.frame': 5000 obs. of 4 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ order_id : int 1 170 915 1983 2442 2737 2869 3176 3785 4090 ...
## $ count_reorders: int 4 6 10 17 5 3 2 1 15 0 ...
## $ count_products: int 8 17 14 29 6 3 12 1 30 3 ...
We can display first 6 observations of our data by using:
head(DF_PS2)
Or display last 6 observations of our data by using:
tail(DF_PS2)
The default setting for the above two commands is 6. We can also manually specify how many observations we want to display by:
head(DF_PS2, n=10)
We can also view the entire data frame as we do in Excel by using the “View()” command.
View(DF_PS2)
It’s easier to remove variables by their position number. All we need to do is to input the column index number. The following code tells R to drop variables that are positioned at third and fourth columns. The minus sign is to drop variables.
DF_PS2_selected <- DF_PS2[,-c(1,2)]
head(DF_PS2_selected)
If we want to select only third and fourth columns, we can delete the minus sign “-” in front of the letter “c.” In addition to using the above method, the following Method 2 - 4 will yield the same output.
subset.data.frame(DF_PS2, select = -c(X, order_id) )
If we want to select only “X” and “order_id” columns, we can delete the minus sign “-” in front of the letter “c.”
# Create a character vector where we store column names which we want to drop.
drop_list <- c("X", "order_id")
# The following line tells R that we want to the drop variables specified in the "drop_list" character vector from the "DF_Bike" dataframe.
DF_PS2[,!names(DF_PS2) %in% drop_list]
If we want to keep only “Duration_sec” and “Birth.Year” columns, we can delete the negation sign “!” in the square brackets.
# Delete column by column index numbers with the "select" command
select(DF_PS2_selected, -c(1:2))
# Delete column by column names with the "select" command
select(DF_PS2, -c("X", "order_id"))
summary(DF_PS2_selected)
## count_reorders count_products
## Min. : 0.000 Min. : 1.00
## 1st Qu.: 2.000 1st Qu.: 5.00
## Median : 5.000 Median : 9.00
## Mean : 6.352 Mean :10.53
## 3rd Qu.: 9.000 3rd Qu.:14.00
## Max. :46.000 Max. :64.00
The default descriptive summary output from the “stargazer” command is like:
stargazer(DF_PS2_selected, type = "text")
##
## ==============================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## --------------------------------------------------------------
## count_reorders 5,000 6.352 5.958 0 2 9 46
## count_products 5,000 10.530 7.870 1 5 14 64
## --------------------------------------------------------------
We can flip the descriptive summary output by setting the “flip” argument as TRUE.
stargazer(DF_PS2_selected, type = "text", flip = TRUE)
##
## =======================================
## Statistic count_reorders count_products
## ---------------------------------------
## N 5,000 5,000
## Mean 6.352 10.530
## St. Dev. 5.958 7.870
## Min 0 1
## Pctl(25) 2 5
## Pctl(75) 9 14
## Max 46 64
## ---------------------------------------
ggplot(DF_PS2_selected, aes(x=count_reorders)) +
geom_histogram() +
labs(x = "Number of reordered items", y="Frequency", title="Reorders per order") +
theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(DF_PS2_selected, aes(x=count_products)) +
geom_histogram() +
labs(x = "Number of products", y="Frequency", title="Products per order") +
theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(DF_PS2_selected, aes(x=count_reorders, y =count_products)) +
geom_point() +
geom_smooth(method = "lm", alpha = .15) +
labs(x = "Number of products", y="Number of reordered items", title="Reorders vs. Products per order") +
theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20))
# Regression
reg <- lm(count_reorders ~ count_products, data = DF_PS2_selected)
# Output the regression result
summary(reg)
##
## Call:
## lm(formula = count_reorders ~ count_products, data = DF_PS2_selected)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.7586 -1.6474 0.3526 1.7279 16.3649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.22666 0.07953 -2.85 0.00439 **
## count_products 0.62469 0.00605 103.26 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.366 on 4998 degrees of freedom
## Multiple R-squared: 0.6809, Adjusted R-squared: 0.6808
## F-statistic: 1.066e+04 on 1 and 4998 DF, p-value: < 2.2e-16
# Report the regression result in a table format
stargazer(reg, type="text", out = "Reg_result.txt")
##
## ================================================
## Dependent variable:
## ----------------------------
## count_reorders
## ------------------------------------------------
## count_products 0.625***
## (0.006)
##
## Constant -0.227***
## (0.080)
##
## ------------------------------------------------
## Observations 5,000
## R2 0.681
## Adjusted R2 0.681
## Residual Std. Error 3.366 (df = 4998)
## F Statistic 10,662.910*** (df = 1; 4998)
## ================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
f1 <- function(df) {
Histo_reorders <- ggplot(DF_PS2_selected, aes(x=count_reorders)) +
geom_histogram() +
labs(x = "Number of reordered items", y="Frequency", title="Reorders per order") +
theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20))
assign("Histo_reordersly", plotly::ggplotly(Histo_reorders), envir=parent.frame())
}
res <- f1(DF_PS2_selected)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histo_reordersly
# Define UI for app that draws a histogram ----
# "fluidPage" creates a display that automatically adjusts to the dimensions of your user’s browser window
ui <- fluidPage(
# App title ----
titlePanel("Histogram demonstration"),
# Sidebar layout with input and output definitions ----
sidebarLayout(
# Sidebar panel for inputs ----
sidebarPanel(
# Input: Slider for the number of bins ----
sliderInput(inputId = "bins",
label = "Number of bins:",
min = 1,
max = 50,
value = 30)
),
# Main panel for displaying outputs ----
mainPanel(
# Output: Histogram ----
plotOutput(outputId = "distPlot")
)
)
)
# Define server logic required to draw a histogram ----
server <- function(input, output) {
# Histogram of the Old Faithful Geyser Data ----
# with requested number of bins
# This expression that generates a histogram is wrapped in a call
# to renderPlot to indicate that:
#
# 1. It is "reactive" and therefore should be automatically
# re-executed when inputs (input$bins) change
# 2. Its output type is a plot
output$distPlot <- renderPlot({
theme_set(theme_bw())
ggplot(DF_PS2, aes(x=count_products)) +
geom_histogram(bins = input$bins + 1) +
labs(x = "Number of products per order", y="Frequency", title="Histogram of products") +
theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20))
})
}
shinyApp(ui = ui, server = server)
## PhantomJS not found. You can install it with webshot::install_phantomjs(). If it is installed, please make sure the phantomjs executable can be found via the PATH variable.
# The height parameter to determine how much vertical space the embedded application should occupy
options = list(height = 500)
f1 <- function(df) {
Scatterplot <- ggplot(df, aes(x=count_reorders, y =count_products)) +
geom_point() +
geom_smooth(method = "lm", alpha = .15) +
labs(x = "Number of products", y="Number of reordered items", title="Reorders vs. Products per order") +
theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20))
assign("Scatterplotly", plotly::ggplotly(Scatterplot), envir=parent.frame())
}
res <- f1(DF_PS2_selected)
Scatterplotly